Efficient mining of distance-based subspace clusters

نویسندگان

  • Guimei Liu
  • Kelvin Sim
  • Jinyan Li
  • Limsoon Wong
چکیده

Traditional similarity measurements often become meaningless when dimensions of datasets increase. Subspace clustering has been proposed to find clusters embedded in subspaces of high dimensional datasets. Many existing algorithms use a grid based approach to partition the data space into nonoverlapping rectangle cells, and then identify connected dense cells as clusters. The rigid boundaries of the grid based approach may cause a real cluster to be divided into several small clusters. In this paper, we propose to use a sliding window approach to partition the dimensions to preserve significant clusters. We call this model nCluster model. The sliding window approach generates more bins than the gridbased approach, thus it incurs higher mining cost. We develop a deterministic algorithm, called MaxnCluster, to mine nClusters efficiently. MaxnCluster uses several techniques to speed up the mining, and it produces only maximal nClusters to reduce result size. Non-maximal nClusters are pruned without the need of storing the discovered nClusters in the memory, which is key to the efficiency of MaxnCluster. Our experiment results show that MaxnCluster can produce maximal nClusters efficiently and accurately.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adapting K-Means Algorithm for Discovering Clusters in Subspaces

Subspace clustering is a challenging task in the field of data mining. Traditional distance measures fail to differentiate the furthest point from the nearest point in very high dimensional data space. To tackle the problem, we design minimal subspace distance which measures the similarity between two points in the subspace where they are nearest to each other. It can discover subspace clusters...

متن کامل

Density-Connected Subspace Clustering for High-Dimensional Data

Several application domains such as molecular biology and geography produce a tremendous amount of data which can no longer be managed without the help of efficient and effective data mining methods. One of the primary data mining tasks is clustering. However, traditional clustering algorithms often fail to detect meaningful clusters because most real-world data sets are characterized by a high...

متن کامل

Subspace Clustering of Microarray Data Based on Domain Transformation

We propose a mining framework that supports the identification of useful knowledge based on data clustering. With the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, given that genes are often coexpressed under subsets of experimental conditions, we present a novel subspace clustering algorithm. In contrast to previous app...

متن کامل

An Efficient Method for Finding Closed Subspace Clusters for High Dimensional Data

Subspace clustering tries to find groups of similar objects from the given dataset such that the objects are projected on only a subset of the feature space. It finds meaningful clusters in all possible subspaces. However, when it comes to the quality of the resultant subspace clusters most of the subspace clusters are redundant. These redundant subspace clusters don’t provide new information. ...

متن کامل

Title: Subspace Clustering of Microarray Data based on Domain Transformation

We propose a mining framework that supports the identification of useful knowledge based on data clustering. With the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, given that genes are often coexpressed under subsets of experimental conditions, we present a novel algorithm on subspace clustering. In contrast to previous ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Statistical Analysis and Data Mining

دوره 2  شماره 

صفحات  -

تاریخ انتشار 2009